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Fault Injection 
Techniques and Tools 


Fault injection is important to evaluating the dependability of computer 
systems. Researchers and engineers have created many novel methods to 
inject faults, which can be implemented in both hardware and software. 
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D ependability evaluation involves the study of 
failures and errors. The destructive nature of 
a crash and long error latency make it difficult 
to identify the causes of failures in the operational 
environment. It is particularly hard to recreate a 
failure scenario for a large, complex system. 

To identify and understand potential failures, we 
use an experiment- based approach for studying the 
dependability of a system. Such an approach is 
applied not only during the conception and design 
phases, but also during the prototype and opera- 
tional phases. u 

To take an experiment- based approach, we must 
first understand a system’s architecture, structure, 
and behavior. Specifically, we need to know its tol- 
erance for faults and failures, including its built-in 
detection and recovery mechanisms, 3 and we need 
specific instruments and tools to inject faults, create 
failures or errors, and monitor their effects. 

DIFFERENT PHASES, DIFFERENT TECHNIQUES 

Engineers most often use low-cost, simulation- 
based fault injection to evaluate the dependability 
of a system that is in the conceptual and design 
phases. At this point, the system under study is only 
a series of high-level abstractions; implementation 
details have yet to be determined. Thus the system 
is simulated on the basis of simplified assumptions. 


Simulation-based fault injection , which assumes 
that errors or failures occur according to predeter- 
mined distribution, is useful for evaluating theeffec - 1 



measurements. Testing a prototype, on the other 
hand, allows us to evaluate the system without any 
assumptions about system design, which yields more 
accurate results. In prototype-based fault injection, 
we inject faults into the system to 

• identify dependability bottlenecks, 

• study system behavior in the presence of faults, 

• determine the coverage of error detection and 
recovery mechanisms, and 

• evaluate the effectiveness of fault tolerance 
mechanisms (such as reconfiguration schemes) 
and performance loss. 


To do prototype- based fault injection, faults are 
injected either at the hardware level (logical or elec- 
trical faults) or at the software level (code or data 
corruption) and the effects are monitored. The sys- 
tem used for evaluation can be either a prototype or 
a fully operational system. Injecting faults into an 
operational system can provide information about 
the failure process. However, fault injection is suit- 
able for studying emulated faults only. It also fails 
to provide dependability measures such as mean 
time between failures and availability. 

Instead of injecting faults, engineers can directly 
measure operational systems as they handle real 
workload s. 2 Measurement-based analysis uses actual 
data, which contains much information about nat- 
urally occurring errors and failures and sometimes 
about recovery attempts. Analyzing these data can i 
provide understanding of actual error and failure < 
characteristics and insight foe analytical models-4 
ffowcvei^measiiremexitrbased analysisis limited to| 
d CTF a crrri errors. Furthermore^ data must be collected:} 
avet a long rime because errors and failiirea occurs 
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Figure 1. Basic 
components of a fault 
injection 
environment. 



infrequently. Field conditions can vary widely, thus The workload generator, monitor, and other compo- 
casting doubt on the statistical validity of the result. nents can be implemented the same way. 

Although each of the three experimental methods 
has its limitations, their unique values complement kflOCtlM MUM Mi iBplNMtStlN 
one another and allow for a wide spectrum of depend- Choosing between hardware and software fault 
1 ability studies. injection depends on the type of faults you are inter- 

ested in and the effort required to create them. For 
FAULT INJECTION TECHNIQUES example, if you are interested in stuck-at faults (faults 

Engineers use fault injection to test fault-tolerant that force a permanent value onto a point in a circuit), 

systems or components. Fault injection tests fault a hardware injector is preferable because you can con- 

I detection, fault isolation, and reconfiguration and trol the location of the fault. The injection of perma- 

recovery capabilities. nent faults using software methods either incurs a high 

overhead or is impossible, depending on the fault. 
Fanlt htfacttofl MYlPMMHt However; if you are interested in data corruption, the 

Figure 1 shows a fault injection environment, which software approach might suffice. Some faults, such as 

| typically consists of the target system plus a fault injec- bit-flips in memory cells, can be injected by either 

toe fault library, workload generator, workload library, method. In a case like this, additional requirements, 

I controller, monitor, data collector, and data analyzer. such as cost, accuracy, intrusiveness, and repeatabil- 

• The fault injector injects faults into the target system itv may guide the choice of approach. Table 1 sum- 

as it executes commands from the workload generator marizes commonly studied faults and injection 

(applications, benchmarks, or synthetic workloads). methods. 

The monitor tracks the execution of the commands and 

initiates data collection whenever necessary. The data HARDWARE FAULT INJECTION 
collector performs online data collection, and the data Hardware-implemented fault injection uses addi- 
: analyzer, which can be offline, performs data process- tional hardware to introduce faults into the target sys- 

I ing and analysis. The controller controls the experiment. tern’s hardware. Depending on the faults and their 

Physically, the controller is a program that can run locations, hardware-implemented fault injecnon meth- 

on the target system or on a separate computet The ods fall into two categones: 

fault injector can be custom-built hardware or soft- 
ware. The fault injector itself can support different • Hardware fault injection with contact. The injec- 
fault types, fault locations, fault times, and appropri- tor has direct physical contact with the target sys- 

ate hardware semantics or software structure — the tem, producing voltage or current changes 

values of which are drawn from a fault library. The externally to the target chip. Examples are meth- 

fault library in Figure 1 is a separate component, ods that use pin-levei probes and sockets, 

which allows for greater flexibility and portability. • Hardware fault injection without contact. The 
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injector has no direct physical contact with the 
target system. Instead, an external source pro* 
duces some physical phenomenon, such as heavy- 
ion radiation and electromagnetic interference, 
causing spurious currents inside the target chip. 

These methods are well suited for studying the 
dependability characteristics of prototypes that 
require high time-resolution for hardware triggering 
and monitoring (fault latency in the CPU, for exam- 
ple) or require access to locations that cannot be eas- 
ily reached by other fault injection methods. 

Engineers generally model hardware methods on 
low-level fault models; for example, a brtdgmg fault 
might be a short circuit. Hardware also triggers faults 
and monitors their impact, thus providing high time- 
resolution and low perturbation. Normally, the hard- 
ware triggers faults after a specified time has expired 
on a hardware timer or after it has detected an event, 
such as a specified address on the address bus. 

Injection with contact 

Hardware fault injection using direct contact with 
circuit pins, often called pin-level injection , is prob- 
ably the most common method of hardware- 
implemented fault injection. There are two main 
techniques for altering electrical currents and volt- 
ages at the pins: 

• Active probes . This technique adds current via 
the probes attached to the pins, altering their elec- 
trical currents. The probe method is usually lim- 
ited to stuck-at faults, although it is possible to 
attain bridging faults by placing a probe across 
two or more pins. Care must be taken when using 
active probes to force additional current into the 
target device, as an inordinate amount of current 
can damage the target hardware. 

• Socket insertion . This technique inserts a socket 
between the target hardware and its circuit 
board. The inserted socket injects stuck-at, open, 
or more complex logic faults into the target hard- 
ware by forcing the analog signals that represent 
desired logic values onto the pins of the target 
hardware. The pin signals can be inverted, 
ANDed, or ORed with adjacent pm signals or 
even with previous signals on the same pin. 

Both of these methods provide good controllabil- 
ity of fault times and locations with little or no per- 
turbation to the target system. Note that because 
faults are modeled at the pin level, they are not iden- 
tical to traditional stuck-at and bridging fault models 
that generally occur inside the chip. Nonetheless, you 
can achieve many of the same effects, like the exercise 


Table 1 . Fault-injection implementation methods by fault model. 


Hardware- Software > - 

iv.. Operv 7 Storage data comiptioa\ . v 

j - Bridging • (such aa register, memory, airt di$k> 

- Bit-flip " iM Communication d^Wrreptiorc 

Spurious current * . (sudrasbus an(tcommunicatfori network) 

Power surge Manifestation ot software defects 

Stuck-at (such as machine level and higher levels) 


of error detection circuits, using these injection meth- 
ods. Active probes attached to the power supply hard- 
ware inject power supply disturbance faults. However, 
this can damage the injected device or increase the risk 
of destructive injection. 


tatacttoa wttfeait contact 

These faults are injected by creating heavy-ion radi- 
ation. An ion passes through the depletion region of 
the target device and generates current. Placing the 
target hardware in or near an electromagnetic field 
also injects faults. Engineers like these methods 
because they mimic natural physical phenomena. 
However, it is difficult to exactly trigger the time and 
location of a fault injection using this technique 
because you cannot precisely control the exact 
moment of heavy-ion emission or electromagnetic 
field creation. 


Sctoctid tocfs 

Messaline, 4 developed at LAAS-CNRS, in Toulouse, 
France, uses both active probes and sockets to con- 
duct pin-level fault injection. Figure 2 on the next page 
shows Messaline’s general architecture and its envi- 
ronment. Messaline can inject stuck-at, open, bridg- 
ing, and complex logical faults, among others. It can 
also control the length of fault existence and the fre- 
quency. Signals collected from the target system can 
provide feedback to the injector. Also, a device is asso- 
ciated with each injection point to sense when and if 
each fault is activated and produces an error. It can 
also inject up to 32 injection points simultaneously. 
This tool has been used in experiments on a central- 
ized, interlocking system employed in a computerized 
railway control system and on a distributed system 
for the Esprit Delta-4 Project. 

FIST 5 (Fault Injection System for Study of Transient 
Fault Effect), developed at the Chalmers University of 
Technology in Sweden, employs both contact and con- 
tactless methods to create transient faults inside the 
target system. This tool uses heavy-ion radiation to 
create transient faults at random locations inside a 
chip when the chip is exposed to the radiation and 
can thus cause single- or multipie-bit-flips. The radi- 



Figure 2 . General 
architecture of 
Messaline. 
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acion source is mounted inside a vacuum chamber 
together with a small two-processor computer sys- 
tem. The computer is positioned so that one of the 
processors is exposed directly under the radiation. 
The other processor is used as a reference for detect- 
ing whether the radiation results in any bit-flips. 
Figure 3 illustrates the FIST environment. 

FIST can inject faults directly inside a chip, which 
cannot be done with pin-level injections. It can pro- 
duce transient faults at random locations evenly in a 
chip, which leads to a large variation in the errors seen 
on the output pins. In addition to radiation, FIST 
allows for the injection of power disturbance faults. 
This is done by placing a MOS transistor between the 
power supply and the V cc pin of the processor chip to 
control the amplitude of the voltage drop. Power sup- 
ply disturbances usually affect multiple locations within 
a chip and can cause gate propagation delay faults. The 
experimental results show that the errors resulting from 
both methods cause similar effects on program con- 
trol-flow and data errors. However, heavy-ion radia- 
tion causes mostly address bus errors, while power 
supply disturbances affect mostly control signals. 

MARS 6 (Maintainable Real-Time System) is a dis- 
tributed, fault-tolerant architecture developed at the 
Technical University of Vienna. In addition to using 
heavy-ion radiation as is used in FIST, MARS uses 
electromagnetic fields to conduct contactless fault 
injection; A circuit board placed between two charged 
plates or a chip placed near a charged probe causes 
fault injection. Dangling wires that act as antennas 
placed on individual chip pins accentuate the electro- 
magnetic field effect on those pins. Researchers com- 
pared these three methods (heavy-ion radiation, 
pin-level injection, and electromagnetic interference) 


in terms of their capability to exercise the MARS error 
detection mechanisms. Results showed that the three 
methods are complementary and generate different 
types of errors. Pin-level injections cause error detec- 
tion mechanisms outside the CPU to be exercised more 
effectively than heavy-ion radiation or electromag- 
netic interference. The latter two methods were bet- 
ter suited for exercising software and application-level 
error detection mechanisms. 

SOFTWARE FAULT INJECTION 

In recent years, researchers have taken more inter- 
est in developing software-implemented fault injec- 
tion tools. Software fault-injection techniques are 
attractive because they don’t require expensive hard- 
ware. Furthermore, they can be used to target appli- 
cations and operanng systems, which is difficult to do 
with hardware fault injection. 

If the target is an application, the fault injector is 
inserted into the application itself or layered between 
the application and the operating system. If the target 
is the operating system, the fault injector must be 
embedded in the operanng system, as it is very difficult 
to add a layer between the machine and the operating 
system. 

Although the software approach is flexible, it has 
its shortcomings. 

• It cannot inject faults into locations that are inac- 
cessible to software. 

• The software instrumentation may disturb the 
workload running on the target system and even 
change the structure of original software. Careful 
design of the injection environment can minimize 
perturbation to the workload. 
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inside vacuum chamber 


Figure 3. FIST 
environment. 



• The poor time-resolution of the approach may 
cause fidelity problems. For long latency faults, 
such as memory faults, the low time-resolution 
may not be a problem. For short latency faults, 
such as bus and CPU faults, the approach may fail 
to capture certain error behavior, like propagation. 
Engineers can solve this problem by taking a 
hybrid approach , which combines the versatility 
of software fault injection and the accuracy of 
hardware monitoring. The hybrid approach is well 
suited for measuring extremely short latencies. 
However, the hardware monitoring involved can 
cost more and decrease flexibility by limiting 
observation points and data storage size. 

We can categorize software injection methods oh 
the basis of when the faults are injected: during com- 
pile-time or during runtime. 

Compile- tlnro infection 

To inject faults at compile-time, the program 
instruction must be modified before the program 
image is loaded and executed. Rather than injecting 
faults into the hardware of the target system, this 
method injects errors into the source code or assem- 
bly code of the target program to emulate the effect 
of hardware, software, and transient faults. The mod- 
ified code alters the target program instructions, caus- 


ing injection. Injection generates an erroneous soft- 
ware image, and when the system executes the fault 
image, it activates the fault. 

This method requires the modification of the pro- 
gram that will evaluate fault effect, and it requires no 
additional software during runtime. In addition, it 
causes no perturbation to the target system during 
execution. Because the fault effect is hard-coded, engi- 
neers can use it to emulate permanent faults. This 
method’s implementation is very simple, but it does 
not allow the injection of faults as the workload pro- 
gram runs. 

Runtime inlections 

During runtime, a mechanism is needed to trigger 
fault injection. Commonly used triggering mecha- 
nisms include: 

• Time-out. In this simplest of techniques, a timer 
expires at a predetermined time, triggering injec- 
tion. Specifically, the time-out event generates an 
interrupt to invoke fault injection. The timer 
can be a hardware or software timer. This 
method requires no modification to the applica- 
tion or workload program. A hardware timer 
must be linked to the system’s interrupt handler 
vector. Since it injects faults on the basis of time 
rather than specific events or system state, it pro- 
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Figure 4. Ftape 
environment. 


duces unpredictable fault effects and program 
behavior. It is, however, suitable for emulating 
transient faults and intermittent hardware faults. 

• Exception! trap. In this case, a hardware excep- 
tion or a software trap transfer control to the 
fault injector. Unlike time-out, excepnon/trap can 
inject the fault whenever certain events or con- 
ditions occur. For example, a software trap 
instruction inserted into a target program will 
invoke the fault injection before the program exe- 
cutes a particular instruction. When the trap exe- 
cutes, an interrupt is generated chat transfers 
control to an interrupt handler. A hardware 
exception invokes injection when a hardware- 
observed event occurs (when a particular mem- 
ory location is accessed, for example). Both 
mechanisms must be linked to the interrupt han- 
dler vector. 

• Code insertion. In this technique, instructions are 
added to the target program that allow fault 
injection to occur before particular instructions, 
much like the code-modification method. Unlike 
code modification, code insertion performs fault 
injection during runtime and adds instructions 
rather than changing original instructions. Unlike 
the trap method, the fault injector may exist as 
part of the target program and run at user mode 
rather than system mode. 

Selectml tools 

Ferrari 7 (Fault and Error Automatic Real-Time 


Injection), developed at the University of Texas at 
Austin, uses software craps to inject CPU, memory, 
and bus faults. Ferrari consists of four components: 
the initializer and activator; the user information, the 
fault-and-error injector, and the data collector and 
analyzer. 

The fault-and-error injector uses software trap and 
trap handling routines. Software traps are triggered 
either by the program counter when it points to the 
desired program locations or by a timer. When the 
traps are triggered, the trap handling routines inject 
faults at the specific fault locations, typically by chang- 
ing the content of selected registers or memory loca- 
tions to emulate actual data corruptions. The faults 
injected can be those permanent or transient faults 
that result in an address line error, a data line error, 
and a condition bit error. 

Experiments conducted on Sun SparcStations 
showed that error detection is highly dependent on 
the fault type. Faults in the task memory resulted in 
the highest level of detection, due mainly to the 
repeated injection of faults when trap instructions 
were placed in program loops. Also, many faults 
injected into I/O routines and system libraries went 
undetected because these routines were less frequently 
exercised. 7 

The Fault Tolerance and Performance Evaluator 
(Ftape), 8 developed at the University of Illinois, con- 
sists of the components shown in Figure 4. Engineers 
can inject faults into user- accessible registers in CPU 
modules, memory locations, and the disk subsystem. 
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The faults are injected as bit-flips to emulate error as 
a result of faults. 

Disk system faults are injected by executing a rou- 
tine in the driver code that emulates I/O errors (bus 
error and timer error, for example). Fault injection dri- 
vers added to the operating system inject the faults, 
so no additional hardware or modification of appli- 
cation code is needed. A synthetic workload genera- 
tor creates a workload containing specified amounts 
of CPU, memory, and I/O activity, and faults are 
injected with a strategy that considers the character- 
istics of the workload at the time of injection (which 
components are experiencing the greatest amount of 
workload activity, for example). Ftape has been used 
on several Tandem fault- tolerant computers and serves 
as the basis of a benchmark for fault tolerance, which 
measures the occurrence of system failures and the 
amount of performance degradation under fault con- 
ditions. 

Doctor 9 (Integrated Software Fault Injection 
Environment), developed at the University of 
Michigan, allows injection of CPU faults, memory 
faults, and network communication faults. It uses three 
triggering methods — time-out, trap, and code modifi- 
cation — to trigger fault injection. Time-out triggers 
memory fault injection. Once time-out occurs, it trig- 
gers the fault injector to overwrite the memory con- 
tent to emulate occurrence of a memory fault. For 
nonpermanent CPU faults, traps trigger fault injection. 
For permanent CPU faults, fault injection is done by 
changing program instructions during compilation to 
emulate instruction and data corruptions due to the 
faults. Doctor has been used on Harts, a distributed, 
real-time system, to investigate the effect of intermit- 
tent message losses between two adjacent nodes and 
the effect of routing using failure data. The researchers 
used experimental results to validate a message deliv- 


ery model and to evaluate different message delivery 
methods. 

Xception, 10 developed at the University of 
Coimbra in Portugal, takes advantage of the 
advanced debugging and performance monitoring 
features present in many modern processors to inject 
more realistic faults. It requires no modification in 
application software and no insertion of software 
traps. Xception, in fact, uses a processor’s built-in 
hardware exception triggers to trigger fault injection. 
The fault injector is implemented as an exception 
handler and requires modification of the interrupt 
handler vector. Xception faults are triggered based 
on access to specific addresses (rather than on a time 
period following an event), so the experiments are 
reproducible. The following events can trigger fault 
injection: 

• opcode fetch from a specified address, 

• operand load from a specified address, 

• operand store to a specified address, 

• a specified time passed since start-up, and 

• a combination of the above fault triggers. 

Each fault has a specified fault mask: a set of bits 
that determines which corresponding bits in the tar- 
get location will be injected. Bits in the fault mask set 
to 1 can use several bit-level operations: stuck-at-zero, 
stuck-at-one, bit-flip, and bridging. Xception has been 
implemented on a Parsytec parallel machine based on 
the PowerPC 601 processor. Experiments revealed the 
deficiency in the error detection mechanisms by show- 
ing that up to 73 percent of injected faults resulted in 
incorrect results that were undetected for certain 
processor functional units. 

Table 2 classifies the hardware and software fault 
injection methods. 
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T he contrast between the hardware and software 
methods lies mainly in the fault injection points 
they can access, cost, and level of perturbation. 
Hardware methods can inject faults into chip pins 
and internal components, such as combinational cir- 
cuits and registers that are not software-addressable. 
On the other hand, software methods are convenient 
for directly producing changes at the software-state 
level (memory, register, for example). Thus, we use 
hardware methods to evaluate low-level error detec- 
tion and masking mechanisms and software meth- 
ods to test higher level mechanisms. Software 
methods are less expensive, but they also incur a 
higher perturbation overhead because they execute 
software on the target system. *> 
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